CS432 - Fall 2001
Assignment 3 - B+-Tree
Deadline: October 23, 23:59pm

Check this page and the newsgroup frequently for updates. Here is a list of frequently asked questions. This is the most difficult assignment we have in this course. Start early on and proceed in steps. Read this assignment and the list of frequently asked questions carefully before you start.

Goal
In this assignment, you are asked to implement a B+-tree index file. Your implementation does not have to deal with duplicates that span multiple pages. You can get an extra 10% (2% of your overall grade since the assignment has a weight of 20%) if your group implements a B+-Tree that can deal with duplicates (full duplicate handling: there might be more duplicates than fit on a page). If you want to test your program and your implementation can not handle duplicates, then you have to change the tests in the file btreetest.cpp to make sure that no duplicate records are inserted.

Introduction
A B+-tree index consists of a collection of records of the form (key, rid), where key is a value for the search key of the index, and rid is the record id of the record being indexed. (It is an index based on what is described as Alternative 2 in the textbook.)

The nodes in a B+-tree can be divided into two categories: internal index nodes and leaf nodes. The leaf nodes store the actual (key, rid) pairs, while the internal nodes store key values and the page identifiers (pageids) of their children.

You should read the textbook carefully and get familiarized with the B+ Tree indexing structure and algorithm.

The space manager
In the last assignment, you were introduced to the Database (DB), Page and Buffer Manager (BufMgr) abstractions. As you may recall, the DB abstraction is responsible for maintaining files, allocation/deallocation of pages and performing I/O. The Page abstraction is just a dummy class of size MAX_SIZE, and the BufMgr is responsible for maintaining "frequently" used pages in memory. In this assignment we are going to build a B+ Tree on top of these abstractions.

The rest of this section explains the space manager: an abstraction that is responsible for keeping records in a page. Since you are going to use this abstraction in this assignment and the next assignment, we recommend that you study it carefully.

HeapPage, Page ID and Record ID
A heap page maintains records using a slot directory located at the beginning of the page. Each slot contains two attributes: offset and length. Records starts from the end of the page. Offset refers to the offset of the record within the page, and length refers to the length of the record.

Each page is uniquely identified by a page id. A record id is a pair (page id, slot number) that uniquely identifies a record.

In MINIBASE, a heap page is implemented by the class HeapPage, which is a subclass of Page. The identifier of a page, the page id, is implemented as type PageID, which is just typedef as int. The identifier of a record, the record id, is implemented as a structure RecordID, containing two fields : a PageID pageNo, and an integer slotNo. Comparison operators == and != are also implemented on RecordID.

The class HeapPage also has a member called type. The type of a page is used in this assignment to indicate the category of the current HeapPage object (whether it is a leaf node or an index node of the B+ Tree).

HeapPage class provides interface to insert and delete records from the page, via InsertRecord and DeleteRecord(). It also provides a function to scan the page. This is done using FirstRecord() and NextRecord().

SortedPage
A SortedPage is a special type of HeapPage that maintains records in a page in increasing sorted order based on a key. It assumes that the first 4 bytes in a record are a key of integer type. Insertion sort is used to insert the records, so the records in a HeapPage may be moved whenever insertion occurs. This may change the RecordID of the records. For this reason, the RecordID of records in a SortedPage should never be exposed to users.

Buffer Manager
You should be very familiar with the Buffer Manager right now. In this assignment you will have a chance to use what you have implemented in the previous assignment.

You will use many Pin and Unpin operations in this assignment. Use Pin() when you have a PageID, and you want to read in the page from disk into memory. After you use the page, you must remember to Unpin() it with the correct dirty flag.

The buffer manager will also give you a new page (using NewPage()) when ever you need one (e.g. when you create a B+ Tree). Remember to free the page when you do not need it anymore (e.g. where you want to delete the B+ Tree, or if the page is deleted from the B+ Tree), by calling FreePage().

Implementation of B+ tree
A B+ Tree index is stored in MINIBASE as a database file.

The class IndexFile is an abstract class for indices. BTreeFile is a subclass of IndexFile, and implements the B+-tree in MINIBASE. The class BTreeFile maintains two types of nodes: BTLeafPage and BTIndexPage, which correspond to the leaf nodes and index (internal) nodes in a B+-tree.

The BTreeFile contains provides methods for inserting, deleting, and searching the tree. It does not contain the tree itself. It maintains the tree by keeping a pointer to the root of the tree. The BTreeFile class is responsible for maintaining the integrity of the tree. (i.e. when an insertion is made, it has to ensure that the new record is inserted in the correct place, or if a new entry is added to an index node, it has to ensure that there is sufficient room in the index node to hold the new entry)

The leaf and index nodes in a B+ Tree are implemented by the classes BTLeafPage and  BTIndexPage, respectively. They are both subclasses of SortedPage and are responsible for inserting into and deleting from leaf node and index node, respectively. These two classes are not responsible for the integrity of the tree.

For this assignment, the BTLeafPage and BTIndexPage are provided for you. We recommend that you study the source code for these two classes carefully. Their implementation can be found in btleaf.cpp and btindex.cpp. You are free to add any methods to these classes that can help with your assignment (try to keep the methods private if you can). For example, a method that finds out which child link to follow when searching for a record, and a method that finds the sibling of a node could be useful. Do not add any member variables to these classes, since they have to map carefully onto the structure of a page.

BTLeafPage
Each indexing entry that we insert or delete from a BTreeFile is a tuple (key, record id). These tuples are stored as records of type LeafEntry in BTLeafPage. To facilitate a scan through the leaf nodes, the leaf nodes are linked together as a doubly linked list. These "pointers" in the link list are actually PageID of the previous page and next page. The two members corresponding to these pointers are called nextPage and prevPage. They can be set and retreived with SetNextPage(), SetPrevPage(), GetNextPage() and GetPrevPage().

Other functions provided for BTLeafPage are Insert and Delete, which (as the name suggest) insert and delete a LeafEntry to/from the page. GetNext(), GetFirst() and GetCurrent() are also provided to scan through the records in a BTLeafPage. A typical use of these functions is shown below.

    int key;
    RecordID rid, outRid;
    Status s = page->GetFirst(key, rid, outRid);
    while (s != DONE) 
    {
        // do something with key and rid
        s = page->GetNext(key, rid, outRid);
    }

BTIndexPage
The BTIndexPage contains a sequence <pid1, key1, pid2, key2, ..., keyk, pidk+1> (the semantics of this sequence are those that are describe in the textbook). The prevPage member in BTIndexPage (also referred to as the leftlink or the leftmost child page) is used to store pid1 while (keyi, pidi+1) are stored as records of type IndexEntry in the BTIndexPage.

Just like in BTLeafPage, Insert(), Delete(), GetFirst() and GetNext() are provided for this class.

Scanning of a B+ Tree
We can also open a scan on a B+ Tree, and retrieve all records within a range of key values. The BTreeFilescan class is a data structure to keep track of the current cursor in the scan. It provides a GetNext() method for retrieving the next record in the BTreeFile We can also delete the current record at the cursor while we are scanning, using the DeleteCurrent() method.

Assignment requirements
In this assignment, you are required to implement two classes: BTreeFile and BTFileScan, link them with the main test program, and make sure that all test programs run successfully. (Note that a successful run does not mean that your program is correct.)

To simplify the assignment, we assume that a BTreeFile only handles integer keys, and overflow or underflow in the B+ tree is handled using the split and merge algorithm. (You are not required to implement redistribution). However, keys may not be unique. But you can assume that all entries with the same key fit into one page.

The details for each of the functions is provided below. We also provide some sample code to illustrate the use of the classes we provide. However these samples are extremely simplified and do not reflect the actual coding that we expect from you. For example, we do not show any error checking in these samples. You are expected to write robust programs by checking the return code and signal errors when necessary.

BTreeFile::BTreeFile
The constructor for the BTreeFile takes in a filename, and checks if a file with that name already exists in the database. If the file exists, we "open" the file. Otherwise we create a new file with that name.

You can use MINIBASE_DB::GetFileEntry() to check if the file exists. A file entry is a pair (filename, pid), where pid is the page id of the "first" page of the file. GetFileEntry() takes in a filename and returns the page id of the first page. In our case, the page id returns should corresponds to the page id of the root note. If no such file is found in the database, GetFileEntry() returns FAIL.

Once we get the page id of the root, we "open" the file by pinning the first page, this will caused the root to be read from disk into memory. The following example shows how to read an existing file and pin the first page.

    // filename contains the name of the BTreeFile to be opened
    Page *page;
    MINIBASE_DB->GetFileEntry(filename, pid);
    MINIBASE_BM->PinPage(pid, page);

To create a new B+ tree index, you should first get a new page, and initialize it. You should also use MINIBASE::AddFileEntry to add a new file entry into the database. The following example shows how to do this :

    Page *page;
    MINIBASE_BM->NewPage(pid, page);
    MINIBASE_DB->AddFileEntry(filename, pid);
    ((SortedPage *)page)->Init(pid);

After the above, you should also initialize the type of the page to the appropiate type (is it LEAF_NODE or INDEX_NODE ?).

    ((SortedPage *)page)->SetType(???);

BTreeFile::~BTreeFile
The destructor of BTreeFile just "closes" the index. This includes unpinning any pages that are being pinned. Note that it does not delete the file.

BTreeFile::DestroyFile
This method deletes the entire index file from the database. You need to free all pages allocated for this index file. The file entry in the database also needs to be removed (using MINIBASE_DB::DeleteFileEntry(filename)).

BTreeFile::Insert
This method inserts a pair (key, rid) into the B+ Tree Index. The actual pair (key, rid) is inserted into a leaf node. But this insertion may cause one or more (key, pid) pair to be inserted into index nodes. You should always check to see if the current node have enough space before you insert. If you don't have enough space, you have to split the current node by creating a new node, and copy some of the data over from the current node to the new node. (If you are implementing duplicate handling, you also have to make sure that entries with the same key remain on the same node, unless there are more duplicate entries than fit on one node. Therefore, in this special case of handling duplicates, we might end up with a node that is less than half full after the insertion.) Splitting will cause a new entry to be added in the parent node.

Splitting of the root node should be considered separately, since if we have a new root, we may need to update the file entry to reflect the changes. Splitting of a leaf node should also be considered separately since the leaf nodes are linked as a link list.

Due to the complexity of this function, we recommend that you write separate functions for different cases. For example, it is a good idea to write a function to insert into a leaf node, and a function to insert into an index node. The following shows some simplified code fragment that may be helpful.

Checking if there is enough space to insert a record into a leaf node :

    if (page->AvailableSpace() ...

Inserting a pair (key, rid) into a leaf node with page ID pid can be done with the following code :

    BTLeafPage *page;
    RecordID outRid;
    MINIBASE_BM->PinPage(pid, (Page *&)page);
    page->Insert(key, rid, outRid);
    MINIBASE_BM->UnpinPage(pid, DIRTY);

BTreeFile::Delete
This method deletes an entry (key, rid) from a leaf node. Deletion from a leaf node may cause one or more entries in the index node to be deleted. You should always check if a node becomes underflowed (less than 50% full) after deletion. If a node becomes underflowed, merging may occur. We can merge if the current node and one of its siblings (either left or right) are both undeflowed. After we merge, some entries in the parent node will need to be deleted (how do you find out which entry ?). Note that you can only merge with a sibling, where a sibling is a node that has the same parent. Note also, that you might not be able to merge, because both siblings are not less than 50% full. You do not have to implement redistribution.

You should consider different scenarios separately (maybe write separate functions for them). You should consider deletion from a leaf node and index node separately. Deletion from the root should also be consider separately (what happens if the root becomes empty after some deletion, but there are still some child nodes?)

The following code fragment may be helpful :

Checking if a node is half full :

    if (page->AvailableSpace() > HEAPPAGE_DATA_SIZE/2) 
    {
        // check if merge can occur
        // merge
    }

BTreeFile::OpenScan
This method should new a BTreeFileScan object and initialize it based on the search range specified. It is useful to find out which leaf node the first record to scan is in.

BTreeFile::DumpStatistics
In this assignment, you are required to collect statistics to reflect the performance of your B+ Tree implementation. This method should print out the following statistics of your B+ Tree.

  1. Total number of nodes in the tree
  2. Total number of data entries in the tree
  3. Total number of index entries in the tree
  4. Average, minimum and maximum fill factor (used space/total space) of leaf nodes and index nodes.
  5. Height of tree

These statistics should serve you in making sure that your code executes correctly. For example, the fill factor of each node should mostly be greater than 0.5, and the height of the tree should be less than 5 or so. (this will vary, of course). You should make sure that DumpStatistics performs this operation.

BTreeFile::Print, BTreeFile::PrintTree, BTreeFile::PrintNode
These are helper functions that should help you debug, by showing the tree contents. However, make sure that you study the code provided carefully, since we introduced a bug. Fix the bug and use it to view your B+ Tree. You are free to modify this function to suit your own debugging requirements.

BTreeFileScan::~BTreeFileScan
First, note that BTreeFileScan does not have a constructor since it is created inside BTreeFile::OpenScan. The destructor of BTreeFileScan should clean up the necessary things (such as unpinning any pinned pages).

BTreeFileScan::GetNext
GetNext returns the next record in the B+ Tree in increasing order of key. Basically, GetNext traverses through the link list of leaf nodes, and returns all the records in them that are within the scan range, one at a time.

For example, if the B+ Tree contains records with keys 1,2,4,5,6,7,8, the following code should print out "2 4 5 ".

    int low = 2;
    int high = 5;
    RecordID rid;
    int key;
    scan = (BTreeFileScan *)tree->OpenScan(&low, &high);
    scan->GetNext(rid, key);
    cout << key << " "; scan->GetNext(rid, key);
    cout << key << " "; scan->GetNext(rid, key);
    cout << key << " "; 

BTreeFileScan::DeleteCurrent
DeleteCurrent should delete the current record, i.e, the record returned by previous call to GetNext(). For example, if the B+ Tree contains records with keys 1,2,4,5,6,7,8, the following code should print out "2 4", and the remaining B+ Tree should contain records with keys 1,5,6,7,8.

    int low = 2;
    int high = 5;
    RecordID rid;
    int key;
    scan = (BTreeFileScan *)tree->OpenScan(&low, &high);
    scan->GetNext(rid, key);
    cout << key << " ";         // should output 2
    scan->DeleteCurrent();      // record with key 2 is deleted
    scan->GetNext(rid, key);
    cout << key << " ";         // should print 4
    scan->DeleteCurrent();      // record with key 4 is deleted
    scan->DeleteCurrent();      // error since record with key 4 is already deleted

Documentation
You should also submit on-line documentation using any format you like, in it you should explain the code that you have written. This should include assumptions that you made, descriptions of any new classes that you have added, and any other special feature we should take note of. As a guideline, the document should be 2-3 pages long (with normal fonts and spacing), or more, if you feel necessary. If you did not complete the assignment, explain the part that you completed so that we can give partial credits (e.g. our deletion works, except that it will cause your program to fail if an index entry is deleted from the root node.) Show sample test runs that you have used to ensure your code is correct, and explain why they are meaningful.

Files
The source code for the project is located in the directory \\goose.csuglab.cornell.edu\courses\cs432-fall01\A3\code. The directory contains, among other files:

btfile.cpp/btfile.h code skeleton for the BTreeFile class
btleaf.cpp/btleaf.h code for BTLeafPage class
btindex.cpp/btindex.h code for the BTIndexPage class
btfilescan.cpp/btfilescan.h code skeleton for BTreeFileScan class
btreetest.cpp/btreetest.h code for test-script interface to the B+-tree
sortedpage.cpp/sortedpage.h code for SortedPage class
bt.h general declaration of types
main.cpp contains main(), runs tests

The definition of public interface for the BTreeFile, BTreeFileScan classes are available, but not fully implemented. BTLeafPage, BTIndexPage and SortedPage are implemented while IndexFile and IndexFileScan are defined, but no implementation is needed.

Just copy the entire code directory to your workspace directory and double-click on btree.dsw. You should now be able to build the project. The files you need to modify are: BTreeFile.cpp and BTreeFileScan.cpp.

Reference
You can find out more about the classes and types here, but the code is always your best source. Use the Visual code browser to look at the code.

User Interface
The interface is either interactive or script based. For a list of commands accepted, execute "btree ?".

A file named rim.txt has been provided as a sample, but it is far from a complete test set.

Below are the commands accepted. Note that when <low> and/or <high> equals -1, they mean min and max, respectively.

insert <low> <high> Insert <high>-<low>+1 random records, within those bounds
scan <low> <high> Scan for <high> <low> records.
delete <low> <high> Delete records between <high> <low>
deletescan <low> <high> Scan and deleteCurrent records between <high> <low>.
No merge expected.
print Print B+ tree.
stats Show stats
quit (or EOF) Termination

Coding convention
You need to follow certain coding conventions for all your assignments.

Submission procedure

How to hand in: You are a group of two students. Drop your project file, source code, executable, and documentation into one of your handin directories in \\goose\courses\cs432-fall01\handinA3. So if students with netids "abc1" and "xyz2" form a group together, one of the two handin directories (for example the directory \\goose\courses\cs432-fal01\handinA3\abc1) will contain the final project of the group. Make sure that the executable is  ready to run, and that it can be recompiled as needed.

You should keep a copy of the project in your own account. No late submission will be accepted.

VERY IMPORTANT Advice

Marking criteria
We will mark your programs based on the following criteria :

You may choose to use your implementation of the buffer manager, or take our implementation.

Miscellaneous

Please let us know if there is any ambiguity in the assignment and please report any possible bugs as soon as possible by mailing the instructor (johannes@cs.cornell.edu).